Accurate travel time estimation is paramount for providing transit users with reliable schedules and dependable real-time information. This paper is the first to utilize roadside urban imagery for direct transit travel time prediction. We propose and evaluate an end-to-end framework integrating traditional transit data sources with a roadside camera for automated roadside image data acquisition, labeling, and model training to predict transit travel times across a segment of interest. First, we show how the GTFS real-time data can be utilized as an efficient activation mechanism for a roadside camera unit monitoring a segment of interest. Second, AVL data is utilized to generate ground truth labels for the acquired images based on the observed transit travel time percentiles across the camera-monitored segment during the time of image acquisition. Finally, the generated labeled image dataset is used to train and thoroughly evaluate a Vision Transformer (ViT) model to predict a discrete transit travel time range (band). The results illustrate that the ViT model is able to learn image features and contents that best help it deduce the expected travel time range with an average validation accuracy ranging between 80%-85%. We assess the interpretability of the ViT model's predictions and showcase how this discrete travel time band prediction can subsequently improve continuous transit travel time estimation. The workflow and results presented in this study provide an end-to-end, scalable, automated, and highly efficient approach for integrating traditional transit data sources and roadside imagery to improve the estimation of transit travel duration. This work also demonstrates the value of incorporating real-time information from computer-vision sources, which are becoming increasingly accessible and can have major implications for improving operations and passenger real-time information.
translated by 谷歌翻译
原产地目的地(O-D)旅行需求预测是运输中的基本挑战。最近,时空深度学习模型展示了提高预测准确性的巨大潜力。但是,很少有研究能够解决细粒O-D矩阵中的不确定性和稀疏问题。这提出了一个严重的问题,因为许多零偏离了确定性深度学习模型的基础的高斯假设。为了解决这个问题,我们设计了一个空间零膨胀的负二项式神经网络(Stzinb-gnn),以量化稀疏旅行需求的不确定性。它使用扩散和时间卷积网络分析空间和时间相关性,然后将其融合以参数化行进需求的概率分布。使用两个具有各种空间和时间分辨率的现实世界数据集对STZINB-GNN进行了检查。结果表明,由于其高精度,紧密的置信区间和可解释的参数,尤其是在高时空分辨率下,Stzinb-GNN比基准模型的优越性。 STZINB-GNN的稀疏参数对各种运输应用具有物理解释。
translated by 谷歌翻译
Person Search aims to simultaneously localize and recognize a target person from realistic and uncropped gallery images. One major challenge of person search comes from the contradictory goals of the two sub-tasks, i.e., person detection focuses on finding the commonness of all persons so as to distinguish persons from the background, while person re-identification (re-ID) focuses on the differences among different persons. In this paper, we propose a novel Sequential Transformer (SeqTR) for end-to-end person search to deal with this challenge. Our SeqTR contains a detection transformer and a novel re-ID transformer that sequentially addresses detection and re-ID tasks. The re-ID transformer comprises the self-attention layer that utilizes contextual information and the cross-attention layer that learns local fine-grained discriminative features of the human body. Moreover, the re-ID transformer is shared and supervised by multi-scale features to improve the robustness of learned person representations. Extensive experiments on two widely-used person search benchmarks, CUHK-SYSU and PRW, show that our proposed SeqTR not only outperforms all existing person search methods with a 59.3% mAP on PRW but also achieves comparable performance to the state-of-the-art results with an mAP of 94.8% on CUHK-SYSU.
translated by 谷歌翻译
Deep learning methods have contributed substantially to the rapid advancement of medical image segmentation, the quality of which relies on the suitable design of loss functions. Popular loss functions, including the cross-entropy and dice losses, often fall short of boundary detection, thereby limiting high-resolution downstream applications such as automated diagnoses and procedures. We developed a novel loss function that is tailored to reflect the boundary information to enhance the boundary detection. As the contrast between segmentation and background regions along the classification boundary naturally induces heterogeneity over the pixels, we propose the piece-wise two-sample t-test augmented (PTA) loss that is infused with the statistical test for such heterogeneity. We demonstrate the improved boundary detection power of the PTA loss compared to benchmark losses without a t-test component.
translated by 谷歌翻译
本文介绍了Omnicity,这是一种从多层次和多视图图像中了解无所不能的城市理解的新数据集。更确切地说,Omnicity包含多视图的卫星图像以及街道级全景图和单视图图像,构成了超过100k像素的注释图像,这些图像是从纽约市的25k Geo-Locations中良好的一致性和收集的。为了减轻大量像素的注释努力,我们提出了一个有效的街景图像注释管道,该管道利用了卫星视图的现有标签地图以及不同观点之间的转换关系(卫星,Panorama和Mono-View)。有了新的Omnicity数据集,我们为各种任务提供基准,包括构建足迹提取,高度估计以及构建平面/实例/细粒细分。我们还分析了视图对每个任务的影响,不同模型的性能,现有方法的局限性等。与现有的多层次和多视图基准相比,我们的Omnicity包含更多具有更丰富注释类型和更丰富的图像更多的视图,提供了从最先进的模型获得的更多基线结果,并为街道级全景图像中的细粒度建筑实例细分介绍了一项新颖的任务。此外,Omnicity为现有任务提供了新的问题设置,例如跨视图匹配,合成,分割,检测等,并促进开发新方法,以了解大规模的城市理解,重建和仿真。 Omnicity数据集以及基准将在https://city-super.github.io/omnicity上找到。
translated by 谷歌翻译
立场检测旨在确定文本的作者是否赞成,反对或中立。这项任务的主要挑战是两个方面的:由于不同目标以及缺乏目标的上下文信息而产生的几乎没有学习。现有作品主要通过设计基于注意力的模型或引入嘈杂的外部知识来解决第二期,而第一个问题仍未探索。在本文中,受到预训练的语言模型(PLM)的潜在能力(PLM)的启发,我们建议介绍基于立场检测的及时基于迅速的微调。 PLM可以为目标提供基本的上下文信息,并通过提示启用几次学习。考虑到目标在立场检测任务中的关键作用,我们设计了目标感知的提示并提出了一种新颖的语言。我们的语言器不会将每个标签映射到具体单词,而是将每个标签映射到矢量,并选择最能捕获姿势与目标之间相关性的标签。此外,为了减轻通过单人工提示来处理不同目标的可能缺陷,我们建议将信息从多个提示中学到的信息提炼。实验结果表明,我们提出的模型在全数据和少数场景中的表现出色。
translated by 谷歌翻译
药物目标亲和力(DTA)预测是药物发现和药物研究的重要任务。 DTA的准确预测可以极大地受益于新药的设计。随着湿实验的昂贵且耗时,DTA预测的监督数据非常有限。这严重阻碍了基于深度学习的方法的应用,这些方法需要大量的监督数据。为了应对这一挑战并提高DTA预测准确性,我们在这项工作中提出了一个具有几种简单但有效的策略的框架:(1)多任务培训策略,该策略将DTA预测和蒙版语言建模(MLM)任务采用配对的药品目标数据集; (2)一种半监督的训练方法,通过利用大规模的未配对分子和蛋白质来赋予药物和靶向代表性学习,这与以前仅利用仅利用预训练的预训练和微调方法,这些方法仅利用前培训和微调方法训练; (3)一个交叉意见模块,以增强药物和靶代表性之间的相互作用。在三个现实世界基准数据集上进行了广泛的实验:BindingDB,Davis和Kiba。结果表明,我们的框架大大优于现有方法,并实现最先进的性能,例如,$ 0.712 $ rmse在bindingdb ic $ _ {50} $测量上,比以前的最佳工作要改善了$ 5 \%。此外,关于特定药物目标结合活动,药物特征可视化和现实世界应用的案例研究证明了我们工作的巨大潜力。代码和数据在https://github.com/qizhipei/smt-dta上发布
translated by 谷歌翻译
Molecular conformation generation aims to generate three-dimensional coordinates of all the atoms in a molecule and is an important task in bioinformatics and pharmacology. Previous methods usually first predict the interatomic distances, the gradients of interatomic distances or the local structures (e.g., torsion angles) of a molecule, and then reconstruct its 3D conformation. How to directly generate the conformation without the above intermediate values is not fully explored. In this work, we propose a method that directly predicts the coordinates of atoms: (1) the loss function is invariant to roto-translation of coordinates and permutation of symmetric atoms; (2) the newly proposed model adaptively aggregates the bond and atom information and iteratively refines the coordinates of the generated conformation. Our method achieves the best results on GEOM-QM9 and GEOM-Drugs datasets. Further analysis shows that our generated conformations have closer properties (e.g., HOMO-LUMO gap) with the groundtruth conformations. In addition, our method improves molecular docking by providing better initial conformations. All the results demonstrate the effectiveness of our method and the great potential of the direct approach. The code is released at https://github.com/DirectMolecularConfGen/DMCG
translated by 谷歌翻译
Compressed videos often exhibit visually annoying artifacts, known as Perceivable Encoding Artifacts (PEAs), which dramatically degrade video visual quality. Subjective and objective measures capable of identifying and quantifying various types of PEAs are critical in improving visual quality. In this paper, we investigate the influence of four spatial PEAs (i.e. blurring, blocking, bleeding, and ringing) and two temporal PEAs (i.e. flickering and floating) on video quality. For spatial artifacts, we propose a visual saliency model with a low computational cost and higher consistency with human visual perception. In terms of temporal artifacts, self-attention based TimeSFormer is improved to detect temporal artifacts. Based on the six types of PEAs, a quality metric called Saliency-Aware Spatio-Temporal Artifacts Measurement (SSTAM) is proposed. Experimental results demonstrate that the proposed method outperforms state-of-the-art metrics. We believe that SSTAM will be beneficial for optimizing video coding techniques.
translated by 谷歌翻译
As one of the most important psychic stress reactions, micro-expressions (MEs), are spontaneous and transient facial expressions that can reveal the genuine emotions of human beings. Thus, recognizing MEs (MER) automatically is becoming increasingly crucial in the field of affective computing, and provides essential technical support in lie detection, psychological analysis and other areas. However, the lack of abundant ME data seriously restricts the development of cutting-edge data-driven MER models. Despite the recent efforts of several spontaneous ME datasets to alleviate this problem, it is still a tiny amount of work. To solve the problem of ME data hunger, we construct a dynamic spontaneous ME dataset with the largest current ME data scale, called DFME (Dynamic Facial Micro-expressions), which includes 7,526 well-labeled ME videos induced by 671 participants and annotated by more than 20 annotators throughout three years. Afterwards, we adopt four classical spatiotemporal feature learning models on DFME to perform MER experiments to objectively verify the validity of DFME dataset. In addition, we explore different solutions to the class imbalance and key-frame sequence sampling problems in dynamic MER respectively on DFME, so as to provide a valuable reference for future research. The comprehensive experimental results show that our DFME dataset can facilitate the research of automatic MER, and provide a new benchmark for MER. DFME will be published via https://mea-lab-421.github.io.
translated by 谷歌翻译